feat: extend intrinsic `matmul` #951

wassup05 · 2025-03-13T16:49:51Z

This PR attempts to address #931

interface stdlib_matmul is created and extended to handle 3-5 matrices. (works for integer, real, complex)

API

A = stdlib_matmul(B, C, D, E, F)
A = stdlib_matmul(B, C, D, E)
A = stdlib_matmul(B, C, D)

The Algorithm for optimal parenthesization is as given in "Introduction to Algorithms" by Cormen, ed4, ch14, section-2.
numpy's linalg.multidot also uses the same algorithm.

Although as @jalvesz had suggested I should have used gemm for the multiplication of component matrices this uses matmul for now, I can make this if deemed appropriate once the major implementation has been given a green signal.

I have added a very basic example to play around with, and I will be adding the detailed docs, specs and tests once everybody approves of the major implementation.

I am not really happy with some parts of the code like computing the size of all the matrices individually, If anyone has any suggestions regarding that or any cleaner way of implementing some other stuff (perhaps some fypp magic) please do let me know

Notes

I think another interesting enhancement would be, if the first array is 1-D, then treat it as a row vector. If the last error is 1-D treat it as a column vector just as numpy does it.

src/stdlib_intrinsics_matmul.fypp

jalvesz · 2025-03-14T18:23:56Z

Here a few ideas for consideration:

Regarding the use of gemm or plain matmult: both gfortran and intel (ifort/ifx) enable replacing the intrinsic matmul by gemm using flags: worth reading https://stackoverflow.com/questions/31494176/will-fortrans-matmul-make-use-of-mkl-if-i-include-the-library and https://community.intel.com/t5/Intel-Fortran-Compiler/qopt-matmul-with-mkl-sequential/td-p/1003110 so it might not be that harmful after all to implement using matmul.
Regarding the signature of the key functions, you could consider:

pure function matmul_chain_order(p) result(s)
    integer, intent(in) :: p(:)
    integer :: s(1:size(p)-1, 2:size(p))
    integer :: m(1:size(p), 1:size(p))
    integer :: l, i, j, k, q, n
    n = size(p)
    ...
end function matmul_chain_order

pure module function stdlib_matmul (m1, m2, m3, m4, m5) result(e)
    real, intent(in) :: m1(:,:), m2(:,:)
    real, intent(in), optional :: m3(:,:), m4(:,:), m5(:,:) !> from the 3rd matrix they can be all optional
    real, allocatable :: e(:,:)
    integer :: p(5), i, num_present
    integer :: s(3,2:4)

    p(1) = size(m1, 1)
    p(2) = size(m2, 1)
    num_present = 2
    if(present(m3)) then
        p(3) = size(m3, 1)
        num_present = num_present + 1
    end if
    if(present(m4)) then
        p(4) = size(m4, 1)
        num_present = num_present + 1
    end if
    if(present(m5)) then
        p(5) = size(m5, 2)
        num_present = num_present + 1
    end if

    s = matmul_chain_order(p(1:num_present))

    ... 
end function stdlib_matmul

For the procedure computing the orders you only need the p array, the dimension can be obtained. One might argue that this induces extra time. I think it will most likely be transparent.

For the main procedure, you could consider using optional arguments from the 3rd matrix onwards and then manage the rest within the same procedure without recursion.

Regarding the example, I would consider much more interesting to set an example a non-trivial example: the actual ordering being different to the naive sequence.

Just some food-for-thought.

wassup05 · 2025-03-14T20:31:07Z

Thank you very much @jalvesz for your detailed remarks, they are quite helpful.
I have refactored the algorithm to only accept p, as n can very well be calculated from that.

Regarding the signature I am a bit confused, because if we don't use recursion we would have to handle 14 cases (4th catalan number) if the number of matrices is 5, which would become quite messy quickly. For 3 matrices it is much efficient to compare the two ordering costs and dispatch an ordering appropriately (numpy claims that it is 15 more efficient than calculating a s matrix for it), 4 matrices case is also somewhat fine considering there are only 5 cases, But I am not sure about how to handle the 5 matrix one without handling all the cases explicitly.. (Or maybe we can just limit this to 4 matrices?) maybe I am missing something here.

And yes I will add some non-trivial examples for sure.

Looking forward to hearing your thoughts.

jalvesz · 2025-03-14T20:47:24Z

Indeed! another idea:

For 3 and 4 matrices make them internal procedures including passing as extra argument the ordering slice corresponding to those matrices. For the one public procedure with 2+3optionals, you compute just once the ordering and call the internal versions for 3 and 4 depending on the case.

wassup05 · 2025-03-15T14:32:44Z

Thank you @jalvesz, I have implemented them accordingly and added some better examples although I am not sure of how I may test the correctness of multiplication with large matrices or compare the time taken by native matmul vs this..

jalvesz · 2025-03-17T07:35:53Z

In terms of correctness I would say that the simplest approach would be to write down a couple of analytical cases for which the exact solution is fixed, also a couple of cases with random matrices for which the optimal ordering is different from the trivial one and compare: stdlib_matmul vs trivial sequential calls to intrinsic matmul. For integer arrays the error should be zero, for real/complex arrays the error tolerance should be somewhere around 100*epsilon(0._wp). The test could be done using the Frobenius norm error = mnorm(A-A_ref), where A would result from the current proposal and A_ref from the intrinsic matmul using the naive sequence.

In terms of performance, It might not be necessary to add that to the PR but, a separate small program showcasing two scenarios might be useful:

Many multiple calls for stdlib_matmul vs naive matmul for small matrices
Few multiple calls for stdlib_matmul vs naive matmul for medium/large matrices

@loiseaujc maybe you have some use cases worth testing?

loiseaujc · 2025-03-17T08:10:07Z

Oh boy ! I had not seen this PR. Super happy with it. I'll take some time this afternoon to look at the code and give you some feedback. As for the test cases, anything involving multiplications with a low-rank matrix already factored would do. As far as I'm concerned, these are the typical cases I encounter where such an interface would prove very useful syntactically. I'll see as I get to the lab if I have a sufficiently simple yet somewhat realistic set of data we could use for testing the performances.

wassup05 · 2025-03-20T20:33:24Z

I have replaced all the matmul's by calls to gemm... The code has become a bit complicated and verbose... If I have missed anything or there was a shorter way of doing the same, do let me know please.
And also now the interface handles only real and complex values.

@jalvesz @perazz

perazz

Very good progress @wassup05.

perazz · 2025-03-21T07:31:43Z

src/stdlib_intrinsics_matmul.fypp

+            n = p(start + 3)
+            k = p(start + 2)
+            allocate(temp(m,n))
+            call gemm('N', 'N', m, n, k, one_${s}$, m2, m, m3, k, zero_${s}$, temp, m)


Very good progress @wassup05, thank you! Imho this PR is almost ready to be merged. As you suggest, it would be good to have a nice wrapper for gemm. It has been discussed before. I would like to suggest that all calls to gemm are also wrapped into a stdlib_matmul function - now with two matrices only. This would give stdlib fully functional matmul functionality.

Here I suggest two possible APIs, and I will ask @jalvesz @jvdp1 @loiseaujc to discuss that together:

The first would be similar to gemm

! API Similar to gemm ${t1}$ function stdlib_matmul(A, Astate, B, Bstate) result(C) ${t1}$, intent(in) :: A(:,:), B(:,:) character, intent(in), optional :: Astate, Bstate

and could use the matrix state definitions already in use for the sparse operations

stdlib/src/stdlib_sparse_constants.fypp

Lines 16 to 18 in 5c64ee6

character(1), parameter :: sparse_op_none = 'N' !! no transpose

character(1), parameter :: sparse_op_transpose = 'T' !! transpose

character(1), parameter :: sparse_op_hermitian = 'H' !! conjugate or hermitian transpose

The second would be more ambitious and essentially zero-overhead, it would wrap the operation in a derived type: (to be templated of course)

type :: matrix_state_rdp real(dp), pointer :: A(:,:) => null() character(1) :: Astate = 'N' end type matrix_state_rdp interface transposed module procedure transposed_new_rdp ... end interface transposed type(matrix_state_rdp) function transposed_new_rdp(A) result(AT) real(dp), intent(inout), target :: A(:,:) AT%A => A AT%Astate = 'T' end function

Then we could define a templated base interface

! Work with normal matrices ${t1}$ function stdlib_matmul(A, B) result(C) ${t1}$, intent(in) :: A(:,:), B(:,:) ! Work with transposed/hermitian swaps ${rt}$ function stdlib_matmul(A, B) result(C) matrix_state_${rn}$, intent(in) :: A, B

So the user writing code would have it clear:

C = strlib_matmul(A, B) C = stdlib_matmul(transposed(A), B) C = stlib_matmul(A, hermitian(C))

we could even make it an operator:

C = strlib_matmul(A, B) C = stdlib_matmul(.t.A, B) C = stlib_matmul(A, .h.C)

without it triggering any actual data movement.

wassup05 added 5 commits March 13, 2025 21:52

add interface and procedures

f06f556

add implementation for 3,4,5 matrices

fed4d73

add very basic example

27911ae

fix typo

a7f645c

a bit efficient

cc77dee

wassup05 commented Mar 14, 2025

View reviewed changes

src/stdlib_intrinsics_matmul.fypp Outdated Show resolved Hide resolved

refactor algorithm

3958018

wassup05 added 4 commits March 15, 2025 03:03

add new interface

35a5a28

add helper functions

ebf92d7

add implementation, refactor select to if clauses

5f5c5a9

slightly better examples

06ce735

replace all matmul's by gemm

e709f83

perazz reviewed Mar 21, 2025

View reviewed changes

perazz mentioned this pull request Mar 21, 2025

Tridagonal matrices and associated spmv kernel #957

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: extend intrinsic `matmul` #951

feat: extend intrinsic `matmul` #951

wassup05 commented Mar 13, 2025

jalvesz commented Mar 14, 2025 •

edited

Loading

wassup05 commented Mar 14, 2025 •

edited

Loading

jalvesz commented Mar 14, 2025

wassup05 commented Mar 15, 2025

jalvesz commented Mar 17, 2025 •

edited

Loading

loiseaujc commented Mar 17, 2025

wassup05 commented Mar 20, 2025

perazz left a comment

perazz Mar 21, 2025 •

edited

Loading

	character(1), parameter :: sparse_op_none = 'N' !! no transpose
	character(1), parameter :: sparse_op_transpose = 'T' !! transpose
	character(1), parameter :: sparse_op_hermitian = 'H' !! conjugate or hermitian transpose

feat: extend intrinsic matmul #951

Are you sure you want to change the base?

feat: extend intrinsic matmul #951

Conversation

wassup05 commented Mar 13, 2025

jalvesz commented Mar 14, 2025 • edited Loading

wassup05 commented Mar 14, 2025 • edited Loading

jalvesz commented Mar 14, 2025

wassup05 commented Mar 15, 2025

jalvesz commented Mar 17, 2025 • edited Loading

loiseaujc commented Mar 17, 2025

wassup05 commented Mar 20, 2025

perazz left a comment

Choose a reason for hiding this comment

perazz Mar 21, 2025 • edited Loading

Choose a reason for hiding this comment

feat: extend intrinsic `matmul` #951

feat: extend intrinsic `matmul` #951

jalvesz commented Mar 14, 2025 •

edited

Loading

wassup05 commented Mar 14, 2025 •

edited

Loading

jalvesz commented Mar 17, 2025 •

edited

Loading

perazz Mar 21, 2025 •

edited

Loading